{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copyright 2020 NVIDIA Corporation. All Rights Reserved.\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License.\n",
    "# =============================================================================="
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NVTabular with TensorFlow Feature Columns\n",
    "We'll be basing this example off of [TensorFlow's structured data tutorial](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/structured_data/feature_columns.ipynb), but we'll show how to move all of its functionality onto the GPU, and even make it robust to larger and more complex use cases (where GPU acceleration will be more pronounced). Let's begin by starting the same way: doing our imports and downloading the data.\n",
    "\n",
    "## VERY IMPORTANT THING TO NOTE\n",
    "Using NVTabular for TensorFlow data loading requires saving some GPU memory for use by the libraries NVT wraps around, namely CuDF. This is slightly non-trivial, since TensorFlow's default behavior is to grab all of the GPU's memory for itself as soon as its context is initialized. In order to ensure this doesn't happen, make sure to set the `TF_MEMORY_ALLOCATION` environment variable up front to the fraction of GPU memory you would like to reserver for TensorFlow and then import the `KerasSequenceLoader` before you do _anything_ with TensorFlow. NVTabular will use that environment variable to properly configure TF to use the memory fraction you dictate (you can also explicitly specify a MB allocation as well)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ[\"TF_MEMORY_ALLOCATION\"] = \"0.5\"\n",
    "\n",
    "import cudf\n",
    "import cupy as cp\n",
    "import nvtabular as nvt\n",
    "from nvtabular.loader.tensorflow import KerasSequenceLoader, KerasSequenceValidater\n",
    "from nvtabular.framework_utils.tensorflow import make_feature_column_workflow, layers\n",
    "\n",
    "import tensorflow as tf\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rather than use Pandas to open the CSV, we'll use CuDF, a GPU-accelerated dataframe library that works just like Pandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip\n",
      "1671168/1668792 [==============================] - 0s 0us/step\n"
     ]
    }
   ],
   "source": [
    "import pathlib\n",
    "\n",
    "dataset_url = \"http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip\"\n",
    "data_dir = os.environ.get(\"DATA_DIR\", \"/tmp/datasets/petfinder-mini\")\n",
    "csv_file = os.path.join(data_dir, \"petfinder-mini.csv\")\n",
    "\n",
    "tf.keras.utils.get_file(\"petfinder_mini.zip\", dataset_url, extract=True, cache_dir=\"/tmp\")\n",
    "dataframe = cudf.read_csv(csv_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Type</th>\n",
       "      <th>Age</th>\n",
       "      <th>Breed1</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Color1</th>\n",
       "      <th>Color2</th>\n",
       "      <th>MaturitySize</th>\n",
       "      <th>FurLength</th>\n",
       "      <th>Vaccinated</th>\n",
       "      <th>Sterilized</th>\n",
       "      <th>Health</th>\n",
       "      <th>Fee</th>\n",
       "      <th>Description</th>\n",
       "      <th>PhotoAmt</th>\n",
       "      <th>AdoptionSpeed</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Cat</td>\n",
       "      <td>3</td>\n",
       "      <td>Tabby</td>\n",
       "      <td>Male</td>\n",
       "      <td>Black</td>\n",
       "      <td>White</td>\n",
       "      <td>Small</td>\n",
       "      <td>Short</td>\n",
       "      <td>No</td>\n",
       "      <td>No</td>\n",
       "      <td>Healthy</td>\n",
       "      <td>100</td>\n",
       "      <td>Nibble is a 3+ month old ball of cuteness. He ...</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Cat</td>\n",
       "      <td>1</td>\n",
       "      <td>Domestic Medium Hair</td>\n",
       "      <td>Male</td>\n",
       "      <td>Black</td>\n",
       "      <td>Brown</td>\n",
       "      <td>Medium</td>\n",
       "      <td>Medium</td>\n",
       "      <td>Not Sure</td>\n",
       "      <td>Not Sure</td>\n",
       "      <td>Healthy</td>\n",
       "      <td>0</td>\n",
       "      <td>I just found it alone yesterday near my apartm...</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Dog</td>\n",
       "      <td>1</td>\n",
       "      <td>Mixed Breed</td>\n",
       "      <td>Male</td>\n",
       "      <td>Brown</td>\n",
       "      <td>White</td>\n",
       "      <td>Medium</td>\n",
       "      <td>Medium</td>\n",
       "      <td>Yes</td>\n",
       "      <td>No</td>\n",
       "      <td>Healthy</td>\n",
       "      <td>0</td>\n",
       "      <td>Their pregnant mother was dumped by her irresp...</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Dog</td>\n",
       "      <td>4</td>\n",
       "      <td>Mixed Breed</td>\n",
       "      <td>Female</td>\n",
       "      <td>Black</td>\n",
       "      <td>Brown</td>\n",
       "      <td>Medium</td>\n",
       "      <td>Short</td>\n",
       "      <td>Yes</td>\n",
       "      <td>No</td>\n",
       "      <td>Healthy</td>\n",
       "      <td>150</td>\n",
       "      <td>Good guard dog, very alert, active, obedience ...</td>\n",
       "      <td>8</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Dog</td>\n",
       "      <td>1</td>\n",
       "      <td>Mixed Breed</td>\n",
       "      <td>Male</td>\n",
       "      <td>Black</td>\n",
       "      <td>No Color</td>\n",
       "      <td>Medium</td>\n",
       "      <td>Short</td>\n",
       "      <td>No</td>\n",
       "      <td>No</td>\n",
       "      <td>Healthy</td>\n",
       "      <td>0</td>\n",
       "      <td>This handsome yet cute boy is up for adoption....</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11532</th>\n",
       "      <td>Dog</td>\n",
       "      <td>24</td>\n",
       "      <td>Poodle</td>\n",
       "      <td>Male</td>\n",
       "      <td>Brown</td>\n",
       "      <td>Golden</td>\n",
       "      <td>Medium</td>\n",
       "      <td>Medium</td>\n",
       "      <td>Not Sure</td>\n",
       "      <td>No</td>\n",
       "      <td>Healthy</td>\n",
       "      <td>0</td>\n",
       "      <td>been at my place for a while..am hoping to fin...</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11533</th>\n",
       "      <td>Cat</td>\n",
       "      <td>1</td>\n",
       "      <td>Domestic Short Hair</td>\n",
       "      <td>Female</td>\n",
       "      <td>Cream</td>\n",
       "      <td>Gray</td>\n",
       "      <td>Medium</td>\n",
       "      <td>Short</td>\n",
       "      <td>No</td>\n",
       "      <td>No</td>\n",
       "      <td>Healthy</td>\n",
       "      <td>0</td>\n",
       "      <td>1 month old white + grey kitten for adoption n...</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11534</th>\n",
       "      <td>Dog</td>\n",
       "      <td>6</td>\n",
       "      <td>Schnauzer</td>\n",
       "      <td>Female</td>\n",
       "      <td>Black</td>\n",
       "      <td>White</td>\n",
       "      <td>Small</td>\n",
       "      <td>Long</td>\n",
       "      <td>Yes</td>\n",
       "      <td>No</td>\n",
       "      <td>Healthy</td>\n",
       "      <td>0</td>\n",
       "      <td>ooooo</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11535</th>\n",
       "      <td>Cat</td>\n",
       "      <td>9</td>\n",
       "      <td>Domestic Short Hair</td>\n",
       "      <td>Female</td>\n",
       "      <td>Yellow</td>\n",
       "      <td>White</td>\n",
       "      <td>Small</td>\n",
       "      <td>Short</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Healthy</td>\n",
       "      <td>0</td>\n",
       "      <td>she is very shy..adventures and independent..s...</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11536</th>\n",
       "      <td>Dog</td>\n",
       "      <td>1</td>\n",
       "      <td>Mixed Breed</td>\n",
       "      <td>Male</td>\n",
       "      <td>Brown</td>\n",
       "      <td>No Color</td>\n",
       "      <td>Medium</td>\n",
       "      <td>Short</td>\n",
       "      <td>No</td>\n",
       "      <td>No</td>\n",
       "      <td>Healthy</td>\n",
       "      <td>0</td>\n",
       "      <td>Fili just loves laying around and also loves b...</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>11537 rows × 15 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      Type  Age                Breed1  Gender  Color1    Color2 MaturitySize  \\\n",
       "0      Cat    3                 Tabby    Male   Black     White        Small   \n",
       "1      Cat    1  Domestic Medium Hair    Male   Black     Brown       Medium   \n",
       "2      Dog    1           Mixed Breed    Male   Brown     White       Medium   \n",
       "3      Dog    4           Mixed Breed  Female   Black     Brown       Medium   \n",
       "4      Dog    1           Mixed Breed    Male   Black  No Color       Medium   \n",
       "...    ...  ...                   ...     ...     ...       ...          ...   \n",
       "11532  Dog   24                Poodle    Male   Brown    Golden       Medium   \n",
       "11533  Cat    1   Domestic Short Hair  Female   Cream      Gray       Medium   \n",
       "11534  Dog    6             Schnauzer  Female   Black     White        Small   \n",
       "11535  Cat    9   Domestic Short Hair  Female  Yellow     White        Small   \n",
       "11536  Dog    1           Mixed Breed    Male   Brown  No Color       Medium   \n",
       "\n",
       "      FurLength Vaccinated Sterilized   Health  Fee  \\\n",
       "0         Short         No         No  Healthy  100   \n",
       "1        Medium   Not Sure   Not Sure  Healthy    0   \n",
       "2        Medium        Yes         No  Healthy    0   \n",
       "3         Short        Yes         No  Healthy  150   \n",
       "4         Short         No         No  Healthy    0   \n",
       "...         ...        ...        ...      ...  ...   \n",
       "11532    Medium   Not Sure         No  Healthy    0   \n",
       "11533     Short         No         No  Healthy    0   \n",
       "11534      Long        Yes         No  Healthy    0   \n",
       "11535     Short        Yes        Yes  Healthy    0   \n",
       "11536     Short         No         No  Healthy    0   \n",
       "\n",
       "                                             Description  PhotoAmt  \\\n",
       "0      Nibble is a 3+ month old ball of cuteness. He ...         1   \n",
       "1      I just found it alone yesterday near my apartm...         2   \n",
       "2      Their pregnant mother was dumped by her irresp...         7   \n",
       "3      Good guard dog, very alert, active, obedience ...         8   \n",
       "4      This handsome yet cute boy is up for adoption....         3   \n",
       "...                                                  ...       ...   \n",
       "11532  been at my place for a while..am hoping to fin...         0   \n",
       "11533  1 month old white + grey kitten for adoption n...         1   \n",
       "11534                                              ooooo         1   \n",
       "11535  she is very shy..adventures and independent..s...         3   \n",
       "11536  Fili just loves laying around and also loves b...         1   \n",
       "\n",
       "       AdoptionSpeed  \n",
       "0                  2  \n",
       "1                  0  \n",
       "2                  3  \n",
       "3                  2  \n",
       "4                  2  \n",
       "...              ...  \n",
       "11532              4  \n",
       "11533              3  \n",
       "11534              0  \n",
       "11535              4  \n",
       "11536              3  \n",
       "\n",
       "[11537 rows x 15 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataframe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can even use the same scikit-learn API to make a train/validation/test split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7383 train examples\n",
      "1846 validation examples\n",
      "2308 test examples\n"
     ]
    }
   ],
   "source": [
    "train, test = train_test_split(dataframe, test_size=0.2)\n",
    "train, val = train_test_split(train, test_size=0.2)\n",
    "print(len(train), 'train examples')\n",
    "print(len(val), 'validation examples')\n",
    "print(len(test), 'test examples')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, however, is where we will begin to diverge. Much of the rest of the code in the original example relies on the fact that your dataset can all be fit into memory at once. If we were willing to restrict ourselves to this use case, we could just replace the preprocessing code from the other notebook with the corresponding CuDF calls in order to move everything on to the GPU:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# In the original dataset \"4\" indicates the pet was not adopted.\n",
    "dataframe[\"target\"] = cp.where(dataframe[\"AdoptionSpeed\"]==4, 0, 1)\n",
    "\n",
    "# Drop un-used columns.\n",
    "dataframe = dataframe.drop(columns=[\"AdoptionSpeed\", \"Description\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could then use an NVTabular `Dataset` object to iterate through these dataframes and pass them to the `KerasSequenceLoader` object, which will take care of passing these dataframes from the land of CuDF into TensorFlow (though we would need to first find a way to transform the string columns, which can't be passed. We'll talk more about this in a bit). The corresponding `df_to_dataset` function for NVTabular would look something like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def df_to_dataset(df, shuffle, batch_size=32):\n",
    "    ds = nvt.Dataset(df)\n",
    "    return KerasSequenceLoader(\n",
    "        ds,\n",
    "        batch_size=batch_size,\n",
    "        shuffle=shuffle,\n",
    "        cat_names=[], # replace these with your categorical feature names,\n",
    "        cont_names=[], # replace these with your continuous feature names,\n",
    "        label_names=[\"target\"]\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You could then proceed with training in much the same way as outlined in the original notebook. However, for most use cases of interest, the dataset is much too large to fit into memory all at once. Indeed, as the original notebook notes at the bottom, if your dataset _can_ fit into memory all at once, it's likely too small to justify using deep learning models in the first place (as opposed to more traditional ML models).\n",
    "\n",
    "This larger-than-memory use case is exactly what motivated the creation of NVTabular, so let's show how we can cast _this_ problem in the framework of NVTabular. We'll see that by adding just a few lines of code, we can achieve the same behavior, only with the enormous benefit that the exact same code can be scaled to terabyte sized datasets with almost _no changes_.\n",
    "\n",
    "Let's begin by exporting our split datasets to different files. These individual files could just as well be groups of files corresponding to billions of rows of data. We'll export the dataframes to the parquet data format instead of csv, since this is the format for whcih NVTabular readers are particularly optimized, but csv would work fine as well if we were so inclined."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "for df, split in zip([train, val, test], [\"train\", \"valid\", \"test\"]):\n",
    "    filename = \"{}/{}.parquet\".format(data_dir, split)\n",
    "    df.to_parquet(filename)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll use an NVTabular `Workflow` object to handle the preprocessing step of encoding the target value. Note that because we need to preprocess the `AdoptionSpeed` column, we're going to call it a categorical variable rather than a target, since label preprocessing isn't currently supported in NVTabular. In this preprocessing setting, the distinction is largely arbitrary anyway.\n",
    "\n",
    "We'll make a custom `LambdaOp` which maps to the appropriate target values, and tell it to replace the original column. Therefore, in this workflow, our target column will be `AdoptionSpeed` rather than a newly created `target` column. We manage to drop the `Description` column simply by not including it in our `Workflow`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat_names = [\n",
    "    \"Type\", \"Breed1\", \"Color1\", \"Color2\", \"MaturitySize\", \"Gender\",\n",
    "    \"FurLength\", \"Vaccinated\", \"Sterilized\", \"Health\", \"AdoptionSpeed\"\n",
    "]\n",
    "cont_names = [\"Age\", \"Fee\", \"PhotoAmt\"]\n",
    "\n",
    "workflow = nvt.Workflow(cat_names=cat_names, cont_names=cont_names, label_name=[])\n",
    "workflow.add_cat_preprocess(nvt.ops.LambdaOp(\n",
    "    \"encode\",\n",
    "    f=lambda col, gdf: cp.where(col == 4, 0, 1),\n",
    "    columns=[\"AdoptionSpeed\"],\n",
    "    replace=True\n",
    "))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For each one of our input datasets (or, rather, the _files_ corresponding to that dataset), we'll loop through it with a `Dataset` object, apply the preprocessing described in our `Workflow`, then export it to a new `processed` directory. If we had statistics we needed to record (because e.g. we want to normalize the continuous columns), we would use the `record_stats` flag on the training set to take care of that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "for split in [\"train\", \"valid\", \"test\"]:\n",
    "    output_path = os.path.join(data_dir, \"processed\", split)\n",
    "    if not os.path.exists(output_path):\n",
    "        os.makedirs(output_path)\n",
    "\n",
    "    dataset = nvt.Dataset(os.path.join(data_dir, split+\".parquet\"))\n",
    "    shuffle = nvt.io.Shuffle.PER_WORKER if split == \"train\" else False\n",
    "    workflow.apply(dataset, output_path=output_path, out_files_per_proc=1, record_stats=(split==\"train\"), shuffle=shuffle)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Columns and Online Preprocessing\n",
    "And that's it! Now we can proceed largely as we do in the original notebook, building the requisite TensorFlow feature columns to represent both the inputs to our network and online preprocessing transformations we would like to apply to them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "animal_type = tf.feature_column.categorical_column_with_vocabulary_list(\n",
    "      'Type', ['Cat', 'Dog'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_columns = []\n",
    "\n",
    "# numeric cols\n",
    "for header in ['PhotoAmt', 'Fee', 'Age']:\n",
    "    feature_columns.append(tf.feature_column.numeric_column(header))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# bucketized cols\n",
    "age = tf.feature_column.numeric_column('Age')\n",
    "age_buckets = tf.feature_column.bucketized_column(age, boundaries=[1, 2, 3, 4, 5])\n",
    "feature_columns.append(age_buckets)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# indicator_columns\n",
    "indicator_column_names = ['Type', 'Color1', 'Color2', 'Gender', 'MaturitySize',\n",
    "                          'FurLength', 'Vaccinated', 'Sterilized', 'Health']\n",
    "for col_name in indicator_column_names:\n",
    "    categorical_column = tf.feature_column.categorical_column_with_vocabulary_list(\n",
    "        col_name, dataframe[col_name].unique().to_pandas())\n",
    "    indicator_column = tf.feature_column.indicator_column(categorical_column)\n",
    "    feature_columns.append(indicator_column)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# embedding columns\n",
    "breed1 = tf.feature_column.categorical_column_with_vocabulary_list(\n",
    "      'Breed1', dataframe.Breed1.unique().to_pandas())\n",
    "breed1_embedding = tf.feature_column.embedding_column(breed1, dimension=8)\n",
    "feature_columns.append(breed1_embedding)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# crossed columns\n",
    "age_type_feature = tf.feature_column.crossed_column([age_buckets, animal_type], hash_bucket_size=100)\n",
    "feature_columns.append(tf.feature_column.indicator_column(age_type_feature))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Except that's still not quite it, because the actual implementation of these online transformations is really suboptimal on GPU for reasons that I go over in excruciating detail in the [_accelerating-tensorflow_](./accelerating_tensorflow.ipynb) notebook. Which is too bad, not only because feature columns are a convenient input representation, but also because if you're a seasoned TF user for tabular data, you probably already have these set up! The last thing I want to make you do is rewrite a bunch of code you've spent months getting together.\n",
    "\n",
    "That's why NVTabular provides a utility `make_feature_column_workflow` which will read through all the feature columns you use, figure out what online preprocessing steps they apply, the recreate them with an equivalent NVTabular `Workflow` that can be mapped on to the `KerasSequenceLoader` we use to load our data in order to be applied online! It will even return a simplified set of feature columns which represent all of the outputs from that `Workflow`. These can be used to instantiate the Keras `DenseFeatures` layer or, even better, since NVTabular is now handling all the online preprocessing, you can use another utility provided by NVTabular, the `ScalarDenseFeatures` layer, to map from these tensors to a dense network input!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "online_workflow, feature_columns = make_feature_column_workflow(feature_columns, \"AdoptionSpeed\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_layer = layers.DenseFeatures(feature_columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can build the dataset-construction function from before, but instead of reading from an in-memory data frame we'll read from our exported parquet datasets (which we'll load in an online fashion). We'll then map this online workflow onto these datasets using the `Dataset.map` method (which, unlike the TensorFlow `Dataset.map` function, operates in-place, just in case you try to return it)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# I would be remiss if I didn't note here that if you want to\n",
    "# be using GPUs for tabular deep learning, you should be using\n",
    "# MUCH larger batch sizes than this (think about another 10\n",
    "# factors of 2). However, we'll use 32 for parity with the original\n",
    "# notebook for illustrative purposes only\n",
    "def make_nvt_dataset(split, shuffle=True, batch_size=32):\n",
    "    ds = KerasSequenceLoader(\n",
    "        os.path.join(data_dir, \"processed\", split, \"*.parquet\"),\n",
    "        batch_size=batch_size,\n",
    "        feature_columns=feature_columns,\n",
    "        label_names=[\"AdoptionSpeed\"],\n",
    "        shuffle=shuffle,\n",
    "        buffer_size=0.02,\n",
    "        parts_per_chunk=1\n",
    "    )\n",
    "    ds.map(online_workflow)\n",
    "    return ds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_ds = make_nvt_dataset(\"train\", shuffle=True)\n",
    "val_ds = make_nvt_dataset(\"valid\", shuffle=False, batch_size=2048)\n",
    "test_ds = make_nvt_dataset(\"test\", shuffle=False, batch_size=4096)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last difference here is that since our `KerasSequenceLoader` returns a `dict` of `{tensor_name: tensor}` pairs, we can't use the `tf.keras.Sequential` object like the original notebook does. Instead, we'll just explicitly create the input tensors using our simplified feature columns then chain the layers together manually. This Keras Functional API approach is much more flexible, and enables the use of models that don't have straightforward, linear paths through them.\n",
    "\n",
    "Note also that rather than pass `val_ds` to the `validation_data` kwarg of `model.fit`, we pass the `KerasSequenceValidater` to the `callbacks` kwarg. The reason for this is that Keras doesn't support its `Sequence` object (from which `KerasSequenceLoader` inherits) as validation data, which complicates the picture for validation datasets that can't fit into memory. This custom callback takes care of calculating online validation metrics and passing them to the model's logs to be fed to downstream callbacks like the print logger. As such, it's always good practice to put it first in your `callbacks` list if you have others that rely on validation metric information (e.g. early stopping or learning rate decay)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1/10\n",
      "231/231 [==============================] - 1s 6ms/step - loss: 0.5855 - accuracy: 0.6691 - val_loss: 0.6302 - val_accuracy: 0.6730\n",
      "Epoch 2/10\n",
      "231/231 [==============================] - 1s 6ms/step - loss: 0.5531 - accuracy: 0.6896 - val_loss: 0.6061 - val_accuracy: 0.6882\n",
      "Epoch 3/10\n",
      "231/231 [==============================] - 1s 6ms/step - loss: 0.5428 - accuracy: 0.6871 - val_loss: 0.6024 - val_accuracy: 0.6876\n",
      "Epoch 4/10\n",
      "231/231 [==============================] - 1s 6ms/step - loss: 0.5358 - accuracy: 0.6936 - val_loss: 0.5971 - val_accuracy: 0.6940\n",
      "Epoch 5/10\n",
      "231/231 [==============================] - 1s 6ms/step - loss: 0.5290 - accuracy: 0.7011 - val_loss: 0.5931 - val_accuracy: 0.7003\n",
      "Epoch 6/10\n",
      "231/231 [==============================] - 1s 6ms/step - loss: 0.5228 - accuracy: 0.7000 - val_loss: 0.5921 - val_accuracy: 0.7001\n",
      "Epoch 7/10\n",
      "231/231 [==============================] - 1s 6ms/step - loss: 0.5185 - accuracy: 0.7054 - val_loss: 0.5900 - val_accuracy: 0.7047\n",
      "Epoch 8/10\n",
      "231/231 [==============================] - 1s 6ms/step - loss: 0.5144 - accuracy: 0.7080 - val_loss: 0.5851 - val_accuracy: 0.7073\n",
      "Epoch 9/10\n",
      "231/231 [==============================] - 1s 6ms/step - loss: 0.5128 - accuracy: 0.7093 - val_loss: 0.5868 - val_accuracy: 0.7089\n",
      "Epoch 10/10\n",
      "231/231 [==============================] - 1s 6ms/step - loss: 0.5086 - accuracy: 0.7175 - val_loss: 0.5882 - val_accuracy: 0.7161\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<tensorflow.python.keras.callbacks.History at 0x7fac0652c950>"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inputs = {}\n",
    "for column in feature_columns:\n",
    "    column = getattr(column, \"categorical_column\", column)\n",
    "    dtype = getattr(column, \"dtype\", tf.int64)\n",
    "    inputs[column.key] = tf.keras.Input(name=column.key, shape=(1,), dtype=dtype)\n",
    "\n",
    "x = feature_layer(inputs)\n",
    "x = tf.keras.layers.Dense(128, activation=\"relu\")(x)\n",
    "x = tf.keras.layers.Dense(128, activation=\"relu\")(x)\n",
    "x = tf.keras.layers.Dropout(.1)(x)\n",
    "x = tf.keras.layers.Dense(1)(x)\n",
    "model = tf.keras.Model(inputs=inputs.values(), outputs=x)\n",
    "\n",
    "model.compile(\n",
    "    optimizer=\"adam\",\n",
    "    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n",
    "    metrics=[\"accuracy\"]\n",
    ")\n",
    "\n",
    "model.fit(\n",
    "    train_ds,\n",
    "    callbacks=[KerasSequenceValidater(val_ds)],\n",
    "    epochs=10\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1/1 [==============================] - 0s 1ms/step - loss: 0.5110 - accuracy: 0.7140\n",
      "Accuracy 0.7140381336212158\n"
     ]
    }
   ],
   "source": [
    "loss, accuracy = model.evaluate(test_ds)\n",
    "print(\"Accuracy\", accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So there it is, end-to-end GPU-accelerated deep learning model development in TensorFlow with the addition of just a couple lines of code. And the best part, as we discussed above, is that this will scale to gigantic datasets and complex models with almost _no_ changes! Fast, simple, and scalable.\n",
    "\n",
    "For more examples of how to do this, take a look at the [_accelerating-tensorflow_](./accelerating-tensorflow.ipynb) notebook for training and the [_criteo-example_](../criteo-example.ipynb) notebook for data preprocessing."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
